Synthesizing high-quality images from text descriptions is a challengingproblem in computer vision and has many practical applications. Samplesgenerated by existing text-to-image approaches can roughly reflect the meaningof the given descriptions, but they fail to contain necessary details and vividobject parts. In this paper, we propose Stacked Generative Adversarial Networks(StackGAN) to generate 256x256 photo-realistic images conditioned on textdescriptions. We decompose the hard problem into more manageable sub-problemsthrough a sketch-refinement process. The Stage-I GAN sketches the primitiveshape and colors of the object based on the given text description, yieldingStage-I low-resolution images. The Stage-II GAN takes Stage-I results and textdescriptions as inputs, and generates high-resolution images withphoto-realistic details. It is able to rectify defects in Stage-I results andadd compelling details with the refinement process. To improve the diversity ofthe synthesized images and stabilize the training of the conditional-GAN, weintroduce a novel Conditioning Augmentation technique that encouragessmoothness in the latent conditioning manifold. Extensive experiments andcomparisons with state-of-the-arts on benchmark datasets demonstrate that theproposed method achieves significant improvements on generating photo-realisticimages conditioned on text descriptions.
展开▼